Spatial Analysis Tutorial¶

Author: Yiming Yang, Rimte Rocher
Date: 2022-03-07
Notebook Source: regress_out.ipynb

This tutorial runs analysis on a 10x Visium mouse brain section dataset with Pegasus.

In [1]:
import numpy as np
import pandas as pd
import pegasusio as io
import pegasus as pg

Load data¶

You can download the data at https://storage.googleapis.com/terra-featured-workspaces/Cumulus/mouse_brain_10x.tar.gz.

After downloading, unzip the tar ball file and load the data into memory:

In [2]:
data = io.read_input('mouse_brain_10x', file_type='visium')
2022-03-07 14:30:29,082 - pegasusio.readwrite - INFO - visium file 'mouse_brain_10x' is loaded.
2022-03-07 14:30:29,084 - pegasusio.readwrite - INFO - Function 'read_input' finished in 1.37s.
In [3]:
data
Out[3]:
MultimodalData object with 1 UnimodalData: 'mm10-visium'
    It currently binds to SpatialData object mm10-visium

SpatialData object with n_obs x n_vars = 2702 x 32285
    Genome: mm10; Modality: visium
    It contains 1 matrix: 'X'
    It currently binds to matrix 'X' as X

    obs: 'in_tissue', 'array_row', 'array_col'
    var: 'featureid'
    obsm: 'X_spatial'
    uns: 'genome', 'modality'
    img: 'sample_id', 'image_id', 'data', 'scale_factor', 'spot_diameter'

Quality Control (QC)¶

Calculate statistics for QC¶

First calculate QC metrics before filtration. Notice that mouse mito gene names start with mt-.

In [4]:
pg.qc_metrics(data, mito_prefix='mt-')

Then we can view the metrics. For example, the code below shows the 2.5% and 97.5% quantiles for number of genes:

In [5]:
np.percentile(data.obs['n_genes'], [2.5, 97.5])
Out[5]:
array([2917.675, 8664.475])

QC filtration¶

Based on the quantiles above, we filter the data by number of genes between 2.5% and 97.5%, and percent of mito gene expression below 20%:

In [6]:
pg.qc_metrics(data, min_genes = 2917, max_genes = 8664, mito_prefix='mt-', percent_mito=20)
In [7]:
pg.qcviolin(data, plot_type='gene', dpi=100)
In [8]:
pg.qcviolin(data, plot_type='count', dpi=100)
In [9]:
pg.qcviolin(data, plot_type='mito', dpi=100)

Now do the actual filteration below:

In [10]:
pg.filter_data(data)
2022-03-07 14:30:30,259 - pegasusio.qc_utils - INFO - After filtration, 2229 out of 2702 cell barcodes are kept in UnimodalData object mm10-visium.
2022-03-07 14:30:30,260 - pegasus.tools.preprocessing - INFO - Function 'filter_data' finished in 0.20s.

And identify robust genes:

In [11]:
pg.identify_robust_genes(data)
2022-03-07 14:30:30,555 - pegasus.tools.preprocessing - INFO - After filtration, 21680/32285 genes are kept. Among 21680 genes, 20184 genes are robust.
2022-03-07 14:30:30,555 - pegasus.tools.preprocessing - INFO - Function 'identify_robust_genes' finished in 0.29s.

Downstream analysis to get clusters¶

In [12]:
pg.log_norm(data)
pg.highly_variable_features(data)
pg.pca(data)
pg.neighbors(data)
pg.leiden(data)
pg.umap(data)
2022-03-07 14:30:30,775 - pegasus.tools.preprocessing - INFO - Function 'log_norm' finished in 0.21s.
2022-03-07 14:30:30,801 - pegasus.tools.hvf_selection - INFO - Function 'estimate_feature_statistics' finished in 0.03s.
2022-03-07 14:30:30,856 - pegasus.tools.hvf_selection - INFO - 2000 highly variable features have been selected.
2022-03-07 14:30:30,857 - pegasus.tools.hvf_selection - INFO - Function 'highly_variable_features' finished in 0.08s.
2022-03-07 14:30:31,535 - pegasus.tools.preprocessing - INFO - Function 'pca' finished in 0.68s.
2022-03-07 14:30:31,825 - pegasus.tools.nearest_neighbors - INFO - Function 'get_neighbors' finished in 0.29s.
2022-03-07 14:30:31,912 - pegasus.tools.nearest_neighbors - INFO - Function 'calculate_affinity_matrix' finished in 0.09s.
2022-03-07 14:30:32,000 - pegasus.tools.graph_operations - INFO - Function 'construct_graph' finished in 0.05s.
2022-03-07 14:30:32,220 - pegasus.tools.clustering - INFO - Leiden clustering is done. Get 11 clusters.
2022-03-07 14:30:32,225 - pegasus.tools.clustering - INFO - Function 'leiden' finished in 0.31s.
2022-03-07 14:30:32,227 - pegasus.tools.nearest_neighbors - INFO - Found cached kNN results, no calculation is required.
2022-03-07 14:30:32,228 - pegasus.tools.nearest_neighbors - INFO - Function 'get_neighbors' finished in 0.00s.
2022-03-07 14:30:32,232 - pegasus.tools.visualization - INFO - Using umap kNN graph because number of cells 2229 is smaller than 4096 or knn_indices is not provided.
UMAP(min_dist=0.5, random_state=0, verbose=True)
Mon Mar  7 14:30:32 2022 Construct fuzzy simplicial set
OMP: Info #273: omp_set_nested routine deprecated, please use omp_set_max_active_levels instead.
Mon Mar  7 14:30:35 2022 Finding Nearest Neighbors
Mon Mar  7 14:30:38 2022 Finished Nearest Neighbor Search
Mon Mar  7 14:30:41 2022 Construct embedding
Epochs completed:   0%|            0/500 [00:00]
Mon Mar  7 14:30:46 2022 Finished embedding
2022-03-07 14:30:46,865 - pegasus.tools.visualization - INFO - Function 'umap' finished in 14.64s.

Run the code below to show the UMAP plot of cells colored by cluster labels generated by Leiden algorithm on their PCA embedding:

In [13]:
pg.scatter(data, attrs='leiden_labels')

DE analysis¶

Run the function below to perform Differential Expression analysis:

In [14]:
pg.de_analysis(data, cluster='leiden_labels')
2022-03-07 14:30:47,914 - pegasus.tools.diff_expr - INFO - CSR matrix is converted to CSC matrix. Time spent = 0.3630s.
2022-03-07 14:31:04,257 - pegasus.tools.diff_expr - INFO - MWU test and AUROC calculation are finished. Time spent = 16.3431s.
2022-03-07 14:31:04,329 - pegasus.tools.diff_expr - INFO - Sufficient statistics are collected. Time spent = 0.0722s.
2022-03-07 14:31:04,422 - pegasus.tools.diff_expr - INFO - Differential expression analysis is finished.
2022-03-07 14:31:04,423 - pegasus.tools.diff_expr - INFO - Function 'de_analysis' finished in 16.87s.

Cell type annotation¶

This is to annotate cluster-specific cell types with preset mouse brain markers:

In [15]:
celltype_dict = pg.infer_cell_types(data, markers = 'mouse_brain')
celltype_dict
Out[15]:
{'1': [name: GABAergic neuron; score: 0.98; average marker percentage: 87.03%; strong support: (Gad1+,94.86%),(Gad2+,89.07%); weak support: (Slc32a1+,77.17%)],
 '2': [name: Glutamatergic neuron; score: 0.75; average marker percentage: 90.70%; strong support: (Slc17a7+,100.00%),(Neurod6+,83.39%),(Neurod2+,88.70%),
  name: GABAergic neuron; score: 0.58; average marker percentage: 84.88%; strong support: (Gad1+,94.35%); weak support: (Slc32a1+,75.42%)],
 '3': [name: GABAergic neuron; score: 1.00; average marker percentage: 93.95%; strong support: (Gad1+,95.16%),(Gad2+,95.97%),(Slc32a1+,90.73%),
  name: Ependyma; score: 0.66; average marker percentage: 23.39%; weak support: (Ccdc153+,23.39%)],
 '4': [name: Oligodendrocyte; score: 1.00; average marker percentage: 96.65%; strong support: (Mbp+,100.00%),(Plp1+,100.00%),(Mog+,100.00%),(Olig1+,99.52%),(Olig2+,85.17%),(Sox10+,95.22%),
  name: Microglia; score: 0.94; average marker percentage: 72.85%; strong support: (C1qb+,81.34%),(Ctss+,81.34%),(Csf1r+,77.51%); weak support: (P2ry12+,51.20%),
  name: Choroid Coch; score: 0.73; average marker percentage: 18.66%; weak support: (Tgfbi+,18.66%)],
 '5': [name: Glutamatergic neuron; score: 0.75; average marker percentage: 94.69%; strong support: (Slc17a7+,100.00%),(Neurod6+,94.69%),(Neurod2+,89.37%)],
 '6': [name: Oligodendrocyte; score: 0.99; average marker percentage: 86.11%; strong support: (Mbp+,100.00%),(Plp1+,100.00%),(Mog+,94.61%),(Olig1+,98.53%); weak support: (Olig2+,56.86%),(Sox10+,66.67%)],
 '7': [name: GABAergic neuron; score: 1.00; average marker percentage: 89.96%; strong support: (Gad1+,90.34%),(Gad2+,95.45%),(Slc32a1+,84.09%),
  name: Oligodendrocyte; score: 1.00; average marker percentage: 91.38%; strong support: (Mbp+,100.00%),(Plp1+,100.00%),(Mog+,96.59%),(Olig1+,99.43%),(Olig2+,69.89%),(Sox10+,82.39%),
  name: Astrocyte; score: 0.50; average marker percentage: 93.47%; strong support: (Aqp4+,89.20%),(Gja1+,97.73%)],
 '8': [name: Astrocyte; score: 0.68; average marker percentage: 81.60%; strong support: (Gja1+,98.05%); weak support: (Aqp4+,85.06%),(F3+,61.69%)],
 '9': [name: Glutamatergic neuron; score: 0.50; average marker percentage: 91.84%; strong support: (Slc17a7+,100.00%),(Neurod2+,83.67%)],
 '10': [name: Mural; score: 1.00; average marker percentage: 73.94%; strong support: (Rgs5+,80.99%),(Acta2+,66.90%),
  name: Choroid Coch; score: 1.00; average marker percentage: 40.85%; strong support: (Tgfbi+,40.85%),
  name: Ependyma; score: 1.00; average marker percentage: 51.41%; strong support: (Ccdc153+,51.41%),
  name: Smooth muscle cell; score: 1.00; average marker percentage: 67.96%; strong support: (Vtn+,87.32%),(Colec12+,48.59%),
  name: Astrocyte; score: 0.96; average marker percentage: 79.58%; strong support: (Aqp4+,90.14%),(Gja1+,95.07%); weak support: (F3+,68.31%),(Prex2+,64.79%),
  name: Endothelial; score: 0.83; average marker percentage: 57.39%; strong support: (Dcn+,69.01%),(Id1+,86.62%); weak support: (Flt1+,59.86%),(Xdh+,14.08%),
  name: Microglia; score: 0.73; average marker percentage: 88.03%; strong support: (C1qb+,94.37%),(Ctss+,87.32%); weak support: (Csf1r+,82.39%),
  name: OPC; score: 0.50; average marker percentage: 59.15%; strong support: (Pdgfra+,59.15%),
  name: Fibroblast; score: 0.50; average marker percentage: 69.01%; strong support: (Dcn+,69.01%)],
 '11': [name: Glutamatergic neuron; score: 0.75; average marker percentage: 92.05%; strong support: (Slc17a7+,100.00%),(Neurod6+,84.62%),(Neurod2+,91.54%)]}
In [16]:
cluster_names = pg.infer_cluster_names(celltype_dict)
pg.annotate(data, name='anno', based_on='leiden_labels', anno_dict=cluster_names)
In [17]:
pg.scatter(data, 'anno', legend_loc='on data')

Besides the UMAP plot, Pegasus also provides spatial function to generate spatial plots. Below is the spatial plot of cells colored by their cell types:

In [18]:
pg.spatial(data, 'anno')